-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use vectorcall for all-positional-argument calls #5896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
If a handle or object is called with only positional arguments, it is
straightforward to use PyObject_Vectorcall instead of
PyObject_CallObject.
Benchmarked by adding a trivial function to pybind11_benchmark:
```
m.def("call_func_with_int", [](py::object func) {
return func(py::cast(1));
});
```
and then running `python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'`.
Before on M4 mac: 57.6 nsec per loop
After on M4 mac: 48.4 nsec per loop
For comparison, the included collatz benchmark takes 33.1 nsec per
loop, just calling `f(1)` directly takes 17.8 nec per loop, and simply
running `pass` takes 4.19 nsec per loop.
include/pybind11/cast.h
Outdated
| // Disable warnings about useless comparisons when N == 0. | ||
| PYBIND11_WARNING_PUSH | ||
| PYBIND11_WARNING_DISABLE_GCC("-Wtype-limits") | ||
| PYBIND11_WARNING_DISABLE_INTEL(186) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why suppressing the icc warning didn't work :(
|
FWIW, I attempted to extend this to use vectorcall for the unpacking/kwargs/etc. cases, but my naive/straightforward attempt ended up adding too much cost for kwarg handling to end up actually improving performance. (the tuple unpacking case in particular is bottlenecked on args_proxy having to use slow PyIter-based iteration instead of PyTuple_GET_ITEM as currently written, so we couldn't help there either even though it doesn't use kwargs.) |
| /// Helper class which collects only positional arguments for a Python function call. | ||
| /// A fancier version below can collect any argument, but this one is optimal for simple calls. | ||
| template <return_value_policy policy> | ||
| template <size_t N, return_value_policy policy> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC there are two downsides with this approach:
- code size increases increases because there will be a template instantiation for each N.
- code complexity increases significantly, making future maintenance more difficult.
The obvious question: Is it really worth it, for real-world situations?
|
|
||
| private: | ||
| tuple m_args; | ||
| std::array<PyObject *, N> m_args; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get away from the N template instantiations by having a maximum N (capacity) and runtime num_args?
| /// Collect only positional arguments for a Python function call | ||
| template <return_value_policy policy, | ||
| typename... Args, | ||
| typename = enable_if_t<args_are_all_positional<Args...>()>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connected to the capacity idea above: enable_if only if <= capacity, is that feasible?
I think we don't have to be stingy with the capacity, e.g. 128 (pointers) would seem totally fine, so in practice this would hardly ever go to the fallback.
If a handle or object is called with only positional arguments, it is straightforward to use PyObject_Vectorcall instead of PyObject_CallObject.
Benchmarked by adding a trivial function to pybind11_benchmark:
and then running
python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'.Before on M4 mac: 57.6 nsec per loop
After on M4 mac: 48.4 nsec per loop
For comparison, the included collatz benchmark takes 33.1 nsec per loop, just calling
f(1)directly takes 17.8 nec per loop, and simply runningpasstakes 4.19 nsec per loop.Suggested changelog entry: